Energy consumption is a critical concern worldwide due to its impact on the environment, economy, and human welfare. Therefore, understanding the factors that influence energy consumption in buildings is essential to optimize energy use and minimize its negative effects. Multiple linear regression is a statistical method used to model the relationship between a dependent variable and several independent variables simultaneously. In this report, we perform a multiple linear regression analysis to investigate the factors that affect energy consumption. The analysis is based on a dataset that includes information on natural gas consumption and several variables related to weather conditions (such as the mean external temperature and the irradiance). The objective of this study is to identify the significant predictors of energy consumption and provide insights into the underlying mechanisms that drive energy use.
The report is organized in the following sections:
The dataset utilized in this analysis is composed by 3 numerical variables, total daily gas consumption Energy \([Smc]\), mean daily external temperature Text \([°C]\), and mean solar irradiance Iext \([W/m^2]\) and 1 categorical variable, the day of the week DayofWeek.
The dataset provides daily measurements of these variables for a full heating season in Turin, which goes from \(1^{st}\) November to \(31^{th}\) March, resulting in a total of 151 records.
In the table below is reported a sketch of the dataset.
The trend of the variables during the heating season is represented in the figure below.
Will be useful for the further steps to summarize the dataset in terms of statistical quantities and distributions:
## date DayOfTheWeek Text Iext
## Min. :2017-11-01 Min. :1.00 Min. :-5.950 Min. : 0.50
## 1st Qu.:2017-12-08 1st Qu.:2.00 1st Qu.:-0.115 1st Qu.: 3.48
## Median :2018-01-15 Median :4.00 Median : 2.920 Median : 34.34
## Mean :2018-01-15 Mean :4.04 Mean : 3.103 Mean : 41.23
## 3rd Qu.:2018-02-21 3rd Qu.:6.00 3rd Qu.: 6.605 3rd Qu.: 71.47
## Max. :2018-03-31 Max. :7.00 Max. :11.610 Max. :182.10
## Energy day_name
## Min. : 0.0 Length:151
## 1st Qu.:257.1 Class :character
## Median :389.2 Mode :character
## Mean :382.2
## 3rd Qu.:556.2
## Max. :676.8
In this section, an outlier detection process is employed, with the aim to identify possible values of the variable analyzed that can be considered far enough from the normal disitribution. Since we are operating with a multivariate distribution of data, the Mahalanobis Distance (MD) seems to be the most appropriate way to deal with. It is important to note that the Mahalanobis distance is sensitive to the distribution of the data and the covariance structure. Therefore, it is recommended to check the assumptions of normality before applying the Mahalanobis distance method for outlier detection.
To check for multivariate normality we can employ the Mardia’s test, which is an extension of the univariate skewness and kurtosis tests to the multivariate case. The test is based on the calculation of the multivariate skewness and kurtosis of the data. The multivariate skewness measures the degree of asymmetry in the distribution, while the multivariate kurtosis measures the degree of peakedness or flatness in the distribution. If the dataset is approximately multivariate normal, the multivariate skewness and kurtosis should be close to zero.
The null hypothesis of Mardia’s test is that the data follows a multivariate normal distribution. If the p-value of the test is below a chosen significance level (e.g., 0.05), we reject the null hypothesis and conclude that the data is not multivariate normal.
Performing the test we obtain:
## Beta-hat kappa p-val
## Skewness 5.095078 128.226121 0.0000000
## Kurtosis 16.126710 1.263892 0.2062688
How can be observed, the p-value of skewness is under 0.05, so we can accept the null hypothesis and declare that the distribution is not normal. For this reason, we have to normalize the multivariate distribution, using the scale function.
After the standardization, we can calculate the mahalanobis distance and perform the Chi-square test to
## [1] "2017-12-10" "2017-12-17" "2017-12-24" "2018-01-07" "2018-01-14"
## [6] "2018-01-28" "2018-02-04" "2018-02-11" "2018-02-18" "2018-02-20"
## [11] "2018-03-29"